-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54656][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in #53276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I personally would want the following changes to VARIANT pushdown in DSv2:
Basically consider the following example: This should pass the following extractions to connector: Connectors mark |
Currently Although you said "each |
I think connectors still can read and parse the variant to required type even it is not a shredded variant. From the view of Spark and DSv2 API, we don't need to know how the connectors fulfill the pushdown requirement. |
I expect each
I feel this is VERY dangerous. I read through the casting logic in Spark. It has so many edge cases. There is no way connectors will replicate this behavior. We don't want to have inconsistent extraction between connectors. In the future, we may add a casting function to |
|
Okay, I'm updating this PR according to the ideas. I will update this soon. |
|
Does the latest proposal make sense to you too, @cloud-fan @gengliangwang @dongjoon-hyun? Any thoughts? |
|
It would be great if we can see the tangible code to review, @aokolnychyi . Could you update the PR if it's ready, @viirya ? |
I will update this once it is ready. |
|
One more question I forgot. Do we need to make |
|
Given the current status, I made a PR to |
…tal` ### What changes were proposed in this pull request? This PR aims to mark `SupportsPushDownVariants` as `Experimental` instead of `Evolving` in Apache Spark 4.1.x. ### Why are the changes needed? During Apache Spark 4.1.0 RC2, it turns out that this new `Variant` improvement feature still needs more time to stabilize. - #52522 - #52578 - #53276 - [[VOTE] Release Spark 4.1.0 (RC2)](https://lists.apache.org/thread/og4dn0g7r92qj22fdsmqoqs518k324q5) We had better mark this interface itself as `Experimental` in Apache Spark 4.1.0 while keeping it `Evolving` in `master` branch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53354 from dongjoon-hyun/SPARK-54616. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Hmm, for the built-in ParquetScan, it is not required because it doesn't send But probably it is required to be |
|
Thank you for updating the PR, @viirya . |
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantExtraction.java
Outdated
Show resolved
Hide resolved
| * @since 4.1.0 | ||
| */ | ||
| @Experimental | ||
| public final class VariantExtractionImpl implements VariantExtraction, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be internal, but means every datasource implementions need to implement their VariantExtraction. I think we can provide one implementation?
| case nestedStruct: StructType => | ||
| field.name +: getColumnName(nestedStruct, rest) | ||
| case _ => | ||
| throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should throw SparkException.internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean SparkException.internalError, right?
| * Pushes down variant field extractions to the data source. | ||
| * <p> | ||
| * Each element in the input array represents one field extraction operation from a variant | ||
| * column. Data sources should examine each extraction and determine whether it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark do any reconciliation? e.g. variant_get(v, '$.a', 'int') and variant_get(v, '$.b', 'int') will be two extractions, but how about variant_get(v, '$.a', 'struct<x int, y int>') and variant_get(v, '$.a.x', 'int')? Are they two extractions as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are two extractions. Each separate get to a variant will be an extraction.
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantExtraction.java
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the PR title with a valid JIRA ID, @viirya .
Thank you @dongjoon-hyun. I updated it. |
|
Thank you so much, @viirya . |
| * @since 4.1.0 | ||
| */ | ||
| @Experimental | ||
| public final class VariantExtractionImpl implements VariantExtraction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the value of separating the interface and imp if we need to expose the impl anyway. Can we follow TableInfo and make VariantExtraction a class directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but let me confirm again: the API means Spark passes VariantExtraction instances to the data source, and data sources need to return bool. Why do data sources need to use the impl class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right. We don't need to expose VariantExtractionImpl but only the interface.
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.execution.datasources; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this package instead? https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I moved it there and make it Scala file instead of Java file now.
| require(expectedDataType != null, "expectedDataType cannot be null") | ||
| require(columnName.nonEmpty, "columnName cannot be empty") | ||
|
|
||
| override def equals(obj: Any): Boolean = obj match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scala case class already implemented these basic methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks.
|
Do you want to address the last Wenchen's comment, @viirya ? |
Yea, addressed it. Thanks. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @viirya , @cloud-fan , @gengliangwang .
Pending CIs.
|
Could you remove unused import too, @viirya ? |
|
I removed it, @viirya . |
Okay. Thanks. |
What changes were proposed in this pull request?
SupportsPushDownVariants was a Scan mix-in in #52578. This patch changes it to be a ScanBuilder mix-in to follow the established patterns in the codebase.
Why are the changes needed?
SupportsPushDownVariants was a Scan mix-in in #52578. This patch changes it to be a ScanBuilder mix-in to follow the established patterns in the codebase, e.g, join pushdown, aggregate pushdown...etc.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code v2.0.14